BLAST and formatdb

blastall and formatdb

usage on PBIL:

formatdb -p F -o F -a F -i xxx.cds
blastall -p blastn -d xxx.cds -i yyy.cds -o output -F F -v 1 -b 1 -e 1e-14 -a 1

formatdb

blastall

More


Formatdb README
------------------

Table of Contents

Introduction

Command Line Options

Configuration File

Formatdb Notes/Troubleshooting

A The -o option and identifiers
B "SORTFiles failed" message
C Formatting large FASTA files
D Piping a database to formatdb without uncompressing
E Creating custom databases.
F General troubleshooting tips.
G "SeqIdParse Failure" error
H "FileOpen" error

Appendix 1: The Files Produced by Formatdb

Introduction
------------
Formatdb must be used in order to format protein or nucleotide source
databases before these databases can be searched by blastall, blastpgp
or MegaBLAST. The source database may be in either FASTA or ASN.1
format. Although the FASTA format is most often used as input to
formatdb, the use of ASN.1 is advantageous for those who are using
ASN.1 as the common source for other formats such as the GenBank
report. Once a source database file has been formatted by formatdb it
is not needed by BLAST. Please note that formatdb does not create
non-redundant blast databases.

If you are having problems formatting a BLAST databases please scroll
down to the "Formatdb Notes/Troubleshooting" section below. Or contact
the BLAST Desk at blast-help@ncbi.nlm.nih.gov


Command Line Options
--------------------
A list of the command line options and the current version for formatdb may 
be obtained by executing formatdb without options, as in:

formatdb -

The formatdb options are summarized below:

formatdb 2.2.5 arguments:

-t Title for database file [String] 
Optional
-i Input file(s) for formatting (this parameter must be set)
[File In]
-l Logfile name: [File Out] 
Optional
default = formatdb.log
-p Type of file
T - protein
F - nucleotide [T/F] Optional
default = T

-o Parse options
T - True: Parse SeqId and create indexes.
F - False: Do not parse SeqId. Do not create indexes.
[T/F] Optional default = F

If the "-o" option is TRUE (and the source database is in FASTA
format), then the database identifiers in the FASTA definition
line must follow the convention of the FASTA Defline Format.
Please see section "F Note on creating custom databases"
below.

-a Input file is database in ASN.1 format (otherwise FASTA is expected)
T - True,
F - False.
[T/F] Optional default = F

-b ASN.1 database in binary mode
T - binary,
F - text mode.
[T/F] Optional default = F

A source ASN.1 database may be represented in two formats -
ascii text and binary. The "-b" option, if TRUE, specifies that
input ASN.1 database is in binary format. The option is ignored
in case of FASTA input database.


-e Input is a Seq-entry [T/F] 
Optional
default = F

A source ASN.1 database (either text ascii or binary) may
contain a Bioseq-set or just one Bioseq. In the latter case the
"-e" switch should be set to TRUE.

-n Base name for BLAST files [String] 
Optional

This options allows one to produce BLAST databases with a
different name than that of the original FASTA file. For
instance, one could have a file named 'ecoli.nuc.txt' and and
format it as 'ecoli':

formatdb -i ecoli.nuc.txt -p F -o T -n ecoli

uncompress -c nr.z | formatdb -i stdin -o T -n nr

This can be used in situations where the original FASTA file is
not required other than by formatdb. This can help in a
situation where disk-space is tight.

-v Database volume size in millions of letters [Integer] Optional
default = 0
range from 0 to

This option breaks up large FASTA files into 'volumes' (each
with a maximum size of 2 billion letters). As part of the
creation of a volume formatdb writes a new type of BLAST
database file, called an alias file, with the extension 'nal'
or 'pal'.

-s Create indexes limited only to accessions - sparse [T/F] 
Optional
default = F

This option limits the indices for the string identifiers (used
by formatdb) to accessions (i.e., no locus names). This is
especially useful for sequences sets like the EST's where the
accession and locus names are identical. Formatdb runs faster
and produces smaller temporary files if this option is used.
It is strongly recommended for EST's, STS's, GSS's, and
HTGS's.

-L Create an alias file with this name
use the gifile arg (below) if set to calculate db size
use the BLAST db specified with -i (above) [File Out] Optional

This option produces a BLAST database alias file using a specified
database, but limiting the sequences searched to those in the GI list
given by the -F argument. See the section "Note on creating an alias file 
for a GI list" for more information.

-F Gifile (file containing list of gi's) [File In] Optional

This option can be used to specify the GI list for the alias file
construction (-L option above) or to produce a binary GI list if
the -B option (below) is set.

-B Binary Gifile produced from the Gifile specified above [File Out] Optional

This option specifies the name of a binary GI list file. This option should
be used with the -F option. A text GI list may be specified with the -F
option and the -B option will produce that GI list in binary format. The
binary file is smaller and BLAST does not need to convert it, so it can
be read faster.


Homepage
Comments

Hide Comments